CIPS-SIGHAN Joint Conference on Chinese Language Processing, Beijing, China, August 28-29, 2010

نویسندگان

  • Le Sun
  • Keh-Jiann Chen
  • Qun Liu
چکیده

The authors propose that we need somechange for the current technology inChinese word segmentation. We shouldhave separate and different phases in theso-called segmentation. First of all, weneed to limit segmentation only to thesegmentation of Chinese characters in-stead of the so-called Chinese words. Incharacter segmentation, we will extractall the information of each character.Then we start a phase called Chinesemorphological processing (CMP). Thefirst step of CMP is to do a combinationof the separate characters and is then fol-lowed by post-segmentation processing,including all sorts of repetitive structures,Chinese-style abbreviations, recognitionof pseudo-OOVs and their processing,etc. The most part of post-segmentationprocessing may have to be done by somerule-based sub-routines, thus we needchange the current corpus-based meth-odology by merging with rule-basedtechnique.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CRF-based Experiments for Cross-Domain Chinese Word Segmentation at CIPS-SIGHAN-2010

This paper describes our experiments on the cross-domain Chinese word segmentation task at the first CIPS-SIGHAN Joint Conference on Chinese Language Processing. Our system is based on the Conditional Random Fields (CRFs) model. Considering the particular properties of the out-of-domain data, we propose some novel steps to get some improvements for the special task.

متن کامل

Chinese Personal Name Disambiguation Based on Person Modeling

This document presents the bakeoff results of Chinese personal name in the First CIPS-SIGHAN Joint Conference on Chinese Language Processing. The authors introduce the frame of person disambiguation system LJPD, which uses a new person model. LJPD was built in short time, and it is not given enough training and adjustment. Evaluation on LJPD shows that the precision is competitive, but the reca...

متن کامل

SIR-NERD: A Chinese Named Entity Recognition and Disambiguation System using a Two-Stage Method

This paper presents our SIR-NERD system for the Chinese named entity recognition and disambiguation Task in the CIPS-SIGHAN joint conference on Chinese language processing (CLP2012). Our system uses a two-stage method and some key techniques to deal with the named entity recognition and disambiguation (NERD) task. Experimental results on the test data shows that the proposed system, which incor...

متن کامل

Word Segmentation on Chinese Mirco-Blog Data with a Linear-Time Incremental Model

This paper describes the model we designed for the word segmentation bakeoff on Chinese micro-blog data in the 2nd CIPS-SIGHAN joint conference on Chinese language processing. We presented a linear-time incremental model for word segmentation where rich features including character-based features, word-based features as well as other possible features can be easily employed. We report the perfo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010